只看准确率有时不够全面,比如对某种罕见病做预测,数据集中阳性占 1%、阴性占 99%,此时无脑预测阴性也有 99% 的准确率
二分类结果的混淆矩阵 (confusion matrix)
| 预测 正样本 | 预测 负样本 | |
|---|---|---|
| 真实 正样本 | $\TP$ (真正例) | $\FN$ (假反例) |
| 真实 负样本 | $\FP$ (假正例) | $\TN$ (真反例) |
precision、recall 也有人译作精确率、召回率,个人觉得没有查准率、查全率好
二分类结果的混淆矩阵 (confusion matrix)
| 预测 正样本 | 预测 负样本 | |
|---|---|---|
| 真实 正样本 | $\TP$ (真正例) | $\FN$ (假反例) |
| 真实 负样本 | $\FP$ (假正例) | $\TN$ (真反例) |
查准率 (precision):预测的约会中有多少比例真的约会了
查全率 (recall):所有的约会中有多少比例被预测出来了
$$ \begin{align*} & \quad \mathrm{precision} = \frac{\TP}{\TP + \FP}, \quad \mathrm{recall} = \frac{\TP}{\TP + \FN} \\[4pt] & \quad \mathrm{F1} = \frac{2 \cdot \mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}} = \frac{2 \cdot \TP}{\text{样本总数} + \TP - \TN \quad} \end{align*} $$
原始数据:表格、图片、视频、文本、语音、……
模型学习:最核心的部分,学习一个用来预测的映射
特征工程:
from sklearn.datasets import load_breast_cancer breast_cancer = load_breast_cancer() print(breast_cancer.DESCR) # -------------------- # **Data Set Characteristics:** # # :Number of Instances: 569 # # :Number of Attributes: 30 numeric, predictive attributes and the class # # :Attribute Information: # - radius (mean of distances from center to points on the perimeter) # - texture (standard deviation of gray-scale values) # - perimeter # - area # - smoothness (local variation in radius lengths) # - compactness (perimeter^2 / area - 1.0) # - concavity (severity of concave portions of the contour) # - concave points (number of concave portions of the contour) # - symmetry # - fractal dimension ("coastline approximation" - 1) # # The mean, standard error, and "worst" or largest (mean of the three # worst/largest values) of these features were computed for each image, # resulting in 30 features. For instance, field 0 is Mean Radius, field # 10 is Radius SE, field 20 is Worst Radius. # # - class: # - WDBC-Malignant 恶性 # - WDBC-Benign 良性 # # :Summary Statistics: # # ===================================== ====== ====== # Min Max # ===================================== ====== ====== # radius (mean): 6.981 28.11 半径 # texture (mean): 9.71 39.28 纹理 # perimeter (mean): 43.79 188.5 周长 # area (mean): 143.5 2501.0 面积 # smoothness (mean): 0.053 0.163 平滑度 # compactness (mean): 0.019 0.345 紧凑度 # concavity (mean): 0.0 0.427 凹度 # concave points (mean): 0.0 0.201 凹点 # symmetry (mean): 0.106 0.304 对称性 # fractal dimension (mean): 0.05 0.097 分形维数 # radius (standard error): 0.112 2.873 # texture (standard error): 0.36 4.885 # perimeter (standard error): 0.757 21.98 # area (standard error): 6.802 542.2 # smoothness (standard error): 0.002 0.031 # compactness (standard error): 0.002 0.135 # concavity (standard error): 0.0 0.396 # concave points (standard error): 0.0 0.053 # symmetry (standard error): 0.008 0.079 # fractal dimension (standard error): 0.001 0.03 # radius (worst): 7.93 36.04 # texture (worst): 12.02 49.54 # perimeter (worst): 50.41 251.2 # area (worst): 185.2 4254.0 # smoothness (worst): 0.071 0.223 # compactness (worst): 0.027 1.058 # concavity (worst): 0.0 1.252 # concave points (worst): 0.0 0.291 # symmetry (worst): 0.156 0.664 # fractal dimension (worst): 0.055 0.208 # ===================================== ====== ====== # # :Missing Attribute Values: None # # :Class Distribution: 212 - Malignant, 357 - Benign # # :Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian # # :Donor: Nick Street # # :Date: November, 1995 # # This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets. # https://goo.gl/U2Uwz2 # # # # Separating plane described above was obtained using # Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree # Construction Via Linear Programming." Proceedings of the 4th # Midwest Artificial Intelligence and Cognitive Science Society, # pp. 97-101, 1992], a classification method which uses linear # programming to construct a decision tree. Relevant features # were selected using an exhaustive search in the space of 1-4 # features and 1-3 separating planes. # # The actual linear program used to obtain the separating plane # in the 3-dimensional space is that described in: # [K. P. Bennett and O. L. Mangasarian: "Robust Linear # Programming Discrimination of Two Linearly Inseparable Sets", # Optimization Methods and Software 1, 1992, 23-34]. # # This database is also available through the UW CS ftp server: # # ftp ftp.cs.wisc.edu # cd math-prog/cpo-dataset/machine-learn/WDBC/ # # .. dropdown:: References # # - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction # for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on # Electronic Imaging: Science and Technology, volume 1905, pages 861-870, # San Jose, CA, 1993. # - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and # prognosis via linear programming. Operations Research, 43(4), pages 570-577, # July-August 1995. # - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques # to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) # 163-171.